-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module/igv/1.0 #303
base: master
Are you sure you want to change the base?
Module/igv/1.0 #303
Conversation
… alleles in one position are incorporated into the batch scripts
…(expanded genomic coordinate output)
modules/igv/1.0/config/default.yaml
Outdated
igv: | ||
|
||
inputs: | ||
# Available wildcards: {unix_group} {seq_type} {tumour_sample_id} {normal_sample_id} {pair_status} {genome_build} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to have {unix_group}
as a wildcard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a standard column in the lcr-modules schema and is something specific to gambl. There are ways to deal with this - can show some examples and we can talk about this more before lab meeting today 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!
I am mostly wondering if we can reuse existing conda envs and scripts to have more standard approach and decrease the maintenance burden 😄
This looks great, long way since the early version! 🚀
maf: "__UPDATE__" | ||
|
||
regions: | ||
# Provide regions files as lists in their respective genome builds so that liftover of coordinates occurs properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if nothing is specified here? Can we add the "__UPDATE__"
to anything that has to be filled in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an "__UPDATE__"
string and specified that at least one regions file must be provided
maf: | ||
grch37: [] | ||
hg38: [] | ||
mutation_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add here the example formatting of what is expected in that file? Does it have to have a header and expects certain column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done ! I added what column and format is required for mutation_id file format
modules/igv/1.0/config/default.yaml
Outdated
options: | ||
genome_map: | ||
# Map metadata builds to grch37 and hg38 so that MAF file locations are determined correctly | ||
grch37: ["__UPDATE__"] # e.g ["grch37","hg19","hs37d5"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you sort of know this upfront, so maybe it is better to fill in these keys and then the comment will be that you can add other genome builds to this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done ✅
modules/igv/1.0/config/default.yaml
Outdated
|
||
liftover_regions: | ||
reference_chain_file: | ||
grch37: "genomes/grch37/chains/grch37/hg19ToHg38.over.chain" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is generated with the reference files so you can directly request the reference files output instead of having this listed here. This can be an example:
lcr-modules/modules/liftover/2.0/liftover.smk
Lines 108 to 130 in 7b495a3
def get_chain(wildcards): | |
if "38" in str({wildcards.genome_build}): | |
return reference_files("genomes/{genome_build}/chains/grch38/hg38ToHg19.over.chain") | |
else: | |
return reference_files("genomes/{genome_build}/chains/grch37/hg19ToHg38.over.chain") | |
# Convert the bed file in hg38 coordinates into hg19 coordinates | |
rule _run_liftover: | |
input: | |
native = rules._liftover_convert_2_bed.output.bed, | |
chains = get_chain | |
output: | |
lifted = temp(CFG["dirs"]["liftover"] + "from--{seq_type}--{genome_build}/{tumour_id}--{normal_id}--{pair_status}.{tool}--{type}.lifted_{chain}.bed"), | |
unmapped = CFG["dirs"]["liftover"] + "from--{seq_type}--{genome_build}/{tumour_id}--{normal_id}--{pair_status}.{tool}--{type}.lifted_{chain}.unmapped.bed" | |
log: | |
stderr = CFG["logs"]["liftover"] + "from--{seq_type}--{genome_build}/{tumour_id}--{normal_id}--{pair_status}.{tool}--{type}.lifted_{chain}.stderr.log" | |
params: | |
mismatch = CFG["options"]["min_mismatch"] | |
conda: | |
CFG["conda_envs"]["liftover-366"] | |
wildcard_constraints: | |
chain = "hg38ToHg19|hg19ToHg38" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done ✅
modules/igv/1.0/config/default.yaml
Outdated
run_unpaired_tumours_with: "unmatched_normal" | ||
run_paired_tumours_as_unpaired: False | ||
|
||
slms_3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you instead just include config from slms-3?
I think this will conflict if you have snakefile that will attempt to run both slms-3 and this module because snakemake won't like duplicated config values - or will just use whatever is imported last so may have unexpected consequence in that scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good idea, I removed this part from the config
bam = CFG["dirs"]["inputs"] + "bams/{seq_type}/{sample_id}.bam", | ||
bai = CFG["dirs"]["inputs"] + "bams/{seq_type}/{sample_id}.bam.bai" | ||
run: | ||
op.absolute_symlink(input.bam, output.bam) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if this needs crai index whe the bam is a cram 😃
input: | ||
maf = get_maf | ||
output: | ||
maf = CFG["dirs"]["inputs"] + "maf/{seq_type}--{genome_build}/{tumour_id}--{normal_sample_id}--{pair_status}.maf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be missing the sample
in the tumour_id wildcard?
Here is the part from config:
# Available wildcards: {unix_group} {seq_type} {tumour_sample_id} {normal_sample_id} {pair_status} {genome_build}
modules/igv/1.0/igv.smk
Outdated
script: | ||
config["lcr-modules"]["igv"]["scripts"]["format_regions"] | ||
|
||
REGIONS_FORMAT = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a dictionary if everything is mapped to the same key bed
? Maybe we can just refer to bed regardless of what is the region format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right- I removed it
script: | ||
config["lcr-modules"]["igv"]["scripts"]["filter_script"] | ||
|
||
def _get_maf(wildcards): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this does not conflict with the function defined above?
def get_maf(wildcards):
unix_group = config["unix_group"]
return expand(config["lcr-modules"]["igv"]["inputs"]["maf"], allow_missing=True, unix_group=unix_group)
Just because it is defined later the further call will use the "latest" definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the earlier function because I do not need it to fill in unix_group value (this can be done from the snakefile that is used to launch the run), so not an issue anymore
modules/igv/1.0/igv.smk
Outdated
|
||
rule _igv_download_igv: | ||
output: | ||
igv_zip = CFG["dirs"]["igv"] + "IGV_2.7.2.zip", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone doesn't run this on linux or wants to have a different version? Maybe this part should be configurable or better even use the conda instead of downloading the source file?
https://anaconda.org/bioconda/igv
I have addressed most of the comments above, but I need to further investigate regarding:
plus i have made added some more changes since, so I will ask for an updated review once I'm finished |
Pull Request Checklists
Important: When opening a pull request, keep only the applicable checklist and delete all other sections.
Checklist for New Module
Required
I used the cookiecutter template and updated the placeholder rules.
The snakemake rules follow the design guidelines.
rules
object (e.g. for input files) are wrapped withstr()
.Every rule in the module is either listed under
localrules
or has thethreads
andresources
directives.Input and output files are being symlinked into the
CFG["inputs"]
andCFG["outputs"]
subdirectories, respectively.I grouped the input symlinking rule to the next job that uses the input files.
I updated the final target rule (
*_all
) to include every output rule.I explained important module design decisions in
CHANGELOG.md
.I tested the module on real data for all supported
seq_type
values.I updated the
default.yaml
configuration file to provide default values for each rule in the module snakefile.I did not set any global wildcard constraints. Any/all wildcard constraints are set on a per-rule basis.
I ensured that all symbolic links are relative and self-contained (i.e. do not point outside of the repository).
I replaced every value that should (or might need to) be updated in the default configuration file with
__UPDATE__
.I recursively searched for all comments containing
TODO
to ensure none were left. For example:If applicable
I added more granular output subdirectories.
I added rules to the
reference_files
workflow to generate any new reference files.I added subdirectories with large intermediate files to the list of
scratch_subdirectories
in thedefault.yaml
configuration file.I updated the list of available wildcards for the input files in the
default.yaml
configuration file.Checklist for Updated Module
Important! If you are updating the module version, ensure the previous version of the module is restored from master.
If you want to restore a deleted file or directory from the remote master, you can use
git checkout origin/master path/to/file
,then a
git commit
will ensure that file is tracked on your branch again.Example: